1. Introduction

1.1 Background

While evaluating the performance of a transit system, ridership has been an extremely important aspect helping the transportation planning development. Even in the challenging time brought by the pandemic, when ridership decrease becomes the nightmare for every transit agency in the US, it is still critical to explore transit ridership in relation to built environment, land use, demographics and system dynamics. Using Automatic Passenger Counter data provided by the sensors, Austin bus ridership data offers a great chance in helping planners to assess the system performance with predictions to backup better decisions.

In 2018, Capital Metropoitan Transportation Authority (CapMetro), a public transportation agency serving Austin, Travis and parts of Williamson Counties, launched the “Cap Remap”, a bus system redesign project, as part of its transit development plan, Connections 2025. Cap Remap adjusted the transit network according to internal analysis and community outreach and aims to provide a more frequent, more reliable, and better connected bus system. Specifically, it remapped certain routes, tripled the number of bus routes that operate every 15 minutes, and made sure the frequency meets the need on weekends. This project brings an opportunity to understand what factors influence bus ridership.

1.2 Use Case

Given the renewed interest in bus transit in US cities, such as Austin, there is an opportunity to streamline the bus planning process using modern data science methods. Currently, cities have to gather all the information, such as land use, built environment, demographics etc., from different sources, to gain understanding of bus ridership change in the future. This method is usually time consuming and requires a lot of human resources. Oftentimes, cities have to outsource those analysis to third parties, which inevitably leads to higher project cost. The goal of this article, therefore, is to present a scenario planning tool for planners to test how changes in local land uses and characteristics of bus routes predict bus ridership. If such a predictive model proves robust, planners can use it to evaluate a series of possible pictures regarding the development of different land use and change of bus routes in the future and make strategic decisions efficiently in Austin. This report is broken into four sections. Section 1 presents an exploratory analysis of the Cap Remap to further understand the trends, patterns, and characteristics of the ridership in Austin, which helps to determine the important features to be incorporated in the predictive model. Section 2 explains the process of model building and model evaluation. Section 3 demonstrates the user interface of the bus network planning application, which is supported by the model developed in Section 2. The last section will be an appendix showing the codes and additional information about the model and application development.

1.3 What Data Are We Using?

The ridership data we used come from the Automated Passenger Counter(APC), which counts the number of boarding and alighting on any given bus. The image below illustrate the APC system at work. source

Thanks to the Cap Remap project and the ridership and bus system collected for it, we were able to get the average weekday daily ridership data in 2019 and use that as the dependent variable.

The dataset also provides info on route characteristics, such as route types and high ridership lines, which we call hotlines.

From multiple open data platforms, we were also able to retrieve built environment data, such as building areas, and amenities.

Then, US census provides comprehensive demographics data, such as population, vehicle ownerships, and median income.

2. Exploratory Analysis

Before diving into the model building, it is crucial to have a good grasp on the characteristics and anatomy of the bus ridership in Austin in order to construct a useful ridership prediction application for planners to utilize. This section aims to investigate Austin’s ridership data provided by APC and answer the following questions: How did ridership change before and after the implementation of CapRemap (06/03/2018)? How does the ridership change across the city? What types of route characteristics have influences on ridership? What are the popular bus routes in Austin and what are their attributes?

2.1 Annual Citywide Ridership Trend

How did ridership change before and after CapRemap (06/03, 2018)? Did it increase after the redesign?

Current available data from Capital Metro allows us to observe the trend in ridership change before and after Cap Remap. The first important part of exploratory analysis is to see the city-wide change in ridership brought by CapRemap. To get the overall trend of ridership change before and after redesign, the monthly ridership data helps to create the system-wide regression discontinuity chart.

The x-axis represents months (June has been marked as 0 since redesign happened here), y represents average daily ridership in the given month. As the color distinguish time difference (before or after redesign), the line type further represents years. Comparing year 2018 and year 2019 from month January to June (-6 to 0 in the graphics), it is obvious that system-wide ridership has generally increased after the redesign. However, in terms of the short-term effect of the redesign, the ridership actually declined in June and July in 2018 just after redesign happened. This decrease was reasonable as passengers normally need some time to adjust to the new schedule.

2.2 Ridership Pattern in Subdivisions

After knowing the trend of city-wide ridership change, the next question is how the ridership changed across the city: which area experienced ridership increase and which area exprienced ridership decrease. We first look at the riderhsip patterns according to general typology to understand the general trend, and then use neighborhoods in Austin are used here to show the spatial trend here.

We first plotted thea average ridership by typology map. UT and CBD region have higher ridership than the rest of the city.

Then we looked at the ridership by neighborhood map. The following maps indicate the average daily ridership in a neighborhood on a given weekday. Within our expectation, downtown and UT Austin neighborhoods (in the middle of the map)are the ones with high ridership in dark blue. Several other neighborhoods to the north side of Austin and southeast side also have high ridership. At the same time, the west side of Austin and the outskirt area of Austin seem to have lower ridership. This is one outcome of a bus system which is mainly focusing on north-south connection.

We then created charts showing the ridership change in each neighborhood in 2018 in June and September. There are 12 neighborhoods experienced ridership decrease from June to September. There are several neighborhoods experienced high ridership increase of more than 10,000 from June to September. Generally, most neighborhoods experienced ridership increase after CapRemap from June to September. Among the 78 neighborhoods in Austin, we identified three neighborhoods that represents different characteristics: neighborhoods with expected ridership increase; neighborhoods with unexpected ridership increase; neighborhoods with unexpected ridership decrease.

Among all the neighborhoods, UT is the neighborhood with expected ridership increase.The location of UT neighborhood is just above downtown neighborhood. With a lot of university students living around here, the bus network is sensitive to school schedule. There is a relatively clear trend in ridership change according to school seasons.

The second neighborhood Govalle is the neighborhood that experiencnig unexpected ridership increase. After CapRemap, the ridership in Govalle nearly increased by 50% to 75%. Govalle is a neighborhood locared in the eastside of Austin which is experiencing gentrification in recent years. The artists are developing industrial spaces in the south of the neighborhood into studios and exhibition spaces. The increasing popularity plus the bus redesign together give an explanation for the ridership increase.

But there are also neighborhoods exepriencing ridership decrease on the east-west direction. Zilker located in the southwest side of Austin’s downtown region. In Zilker, however, the ridership seems not much impacted by the redesign as it remains stable before and after the redesign. The stable ridership is not surprising since Zilker locates to the south of downtown Austin within a walkable distance. It is a neighborhood with a lot of parks, green spaces and leisures. The demand for bus system is thus not as strong as other neighborhoods.

2.3 Route Analysis

2.3.1 Route Type

What could potentially influence ridership in terms of route information?

There are several route types, each serving different purposes. Our hypothesis is that they will play an important role in determing the ridership.

The Austin Bus System is comprised of nine types of routes. The graphs below show six main route types. Since Capital Metro is a regional transit agency so that its service area covers more than City of Austin. Since we mainly focus our analysis and model building in Austin, the basemap below outlines only City of Ausitn.

Regarding route types, the characteristics are listed below:

Local: Capital Metro’s Local routes are intended to connect specific neighborhoods of Austin to Downtown Austin, with frequent stops.

MetroRapid: Capital Metro’s MetroRapid routes is an ostensibly bus rapid transit service serving high-traffic corridors. The service utilizes high-frequency service of every 15 minutes on weekdays with 10 minute service at rush hours.

UT Shuttle: The UT Shuttle system includes a number of routes during the University of Texas semester. They do not operate on Saturdays, except during finals.

Crosstown: Capital Metro’s Crosstown routes are local services between two neighborhoods of Austin, for which the route does not pass through Downtown Austin or the University of Texas.

Limited & Flyer: Capital Metro’s Limited and Flyer routes are limited stop services between two destinations. Limited routes tend to have fewer stops compared to their local counterparts, while Flyer routes serve nonstop between downtown or the UT campus and their neighborhoods of service.

Feeder: Capital Metro’s Feeder routes are local services between a neighborhood and a major transfer point for connecting service.

2.3.2 Hotlines

What makes a good bus system? What’s so special about the ‘hotlines’?

The following analysis aim to find out what routes are popular, why are they popular, and how they have changed in a micro perspective. Kmeans Cluster Analysis was used to separate the disaggregated data into groups. Kmeans is an unsupervised learning algorithm that automatically group the dataset based on the distribution of each feature. We intend to use this algorithm to see if the resulting grouping identifies the hotlines, i.e. the routes that have higher ridership.

We looked at the Kmeans analysis both before and after the Cap Remap. We get 4 lines labeled as hotlines before the remap, 6 lines labeled as hotlines after the remap. The hotlines before and after the remap are plotted below. Most of the hot routes are north-south direction. There are two new hotlines emerged after the CapRemap, line 10 and line 20, and they are colored in red.

To dive deeper into the characteristics of the hot bus lines, we map out the passenger load for each route at each stop for each direction. We also ploted the passenger load versus stop sequence ID as well as average boarding and alighting at each stop along each route. The purpose of this analysis is to first, find out what is so special about the hotlines, and second, see trends before and after the Cap Remap. Note that the Austin bus system has different patterns for each route, and in order to make sure the plots to make sense, we only selected the most used pattern for each plot. Below we chose two Line 20 (type High Frequency) and Line 801(Metro Rapid) to demonstrate detailed route analysis.

Below is the analysis for Line 801.

By mapping and plotting the average passenger number on bus as well as the average boarding and alighting at each stop, we can see better how specific location or neighborhood could potentially contribute to the ridership. These regions will be feature engineered in the following analysis. We also noticed that ridership tends to be higher in the middle portion of the trip, this means a lot of the passengers board from early stops to stops near the ends.

In conclusion, hotlines have the following characteristics:

  • In terms of bus route types, Local, MetroRapid, and High Frequency routes have high ridership
  • In terms of geographical distribution:
    • Go through Hubs (UT, DT, Pleasant Valley)
    • Mostly North-South direction (Following the shape / geography of the city)
    • Going across the a large portion of the city
  • In terms of temporal trend, we know that more Shifts were added in the day time and rush hours, which might increase ridership.

2.4 Other features in Austin

Besides the route features that are already in the bus system, clearly there are other features related to the built environment itself which determins whether an area will have high riderhsip or not. In this section, three different types of features will be explored to further reveal ridership pattern in Austin. ### 2.4.1 Amenities Amenities in a city normally are key nodes including schools, stadiums,supermarket, etc. These nodes are the destinations that travelers might head to. Thus, they can be an influencer to bus system ridership variation. The amenities we investigated here are offices and schools. Since the ridership data is the average daily ridership on a given weekday, offices and schools, as popular desinations for commuters worth investigation. As the following maps indicate, most offices gather in downtown area while schools are scattered in every neighborhood. Compared with the ridership map, offices are likely to have a large impact on ridership at the north side of Austin, across the Colorado River. As schools are pretty evenly distributed around the neighborhoods, it is hard to measure which area has been impacted more.

2.4.2 Built Environment

Many factors in the built environment could have a strong impact on bus ridership. Landuse would no doubt be an important feature to consider according to transit planning experiences. In order to reflect the impact of landuse on ridership, two types of landuse are selected: commercial and civic. As our initial priority is commercial and residential, there are too many residential landuse within Austin which makes the map hard to reveal the pattern. Thus the final landuse types are commercial and civic. According to the following maps, commercial landuse is mostly along the main corridors in Austin. As we expected to see more commercial landuse within downtown area, it is actually scatterly distributed. Comparing commercial landuse map with the ridership map, it is clear that in every high-ridership neighborhood,there are relatively more commercial landuse. For civic landuse, some of the areas are noticeable large compared to commercial landuse. In the University of Texas neighborhood, where the civic landuse occupies a large amount of the space, the ridership is very high.

2.4.3 Demographics

Lastly, demographics is the ultimate features that we take into account.In this case, some basic characteristics about the areas might influence the ridership such as race, income and age. Here we will use vehicle ownership, more specifically, households without vehicle, as an exploratory feature. Since demographic data has the unit of census tract, it reveals more details than the ridership map. Mostly, in the neighborhoods with high ridership, there are some census tract where more households have no vehicle ownership. In this aspect, it clearly indicates the importance of our application which allows planners to know the ridership change according to demographic features, which can effectively improve resource allocation considering social equity.

3. Modeling

3.1 Strategies

We will be creating a machine learning model that predicts the ridership at each stop. This model will allow planners to test diffrent scenarios in which large development, land use change or route frequency change could largely impact the system ridership. To make the prediction model more accurate and generalizable, we will look at how Linear, Lasso and Ridge regression, Random Forest, and Xgboost captures the variability in our dataset. We will start with feature engineering, which consists of 5 major categories: amenity, built environment, demographics, route network, stop characteristics (internal data). The hypothesis is that these five categories will influence the ridership at each stop in different ways. The dependent variable we used is the average ridership for each stop in 2019. We use 2019 data because we want to focus on the data after Cap Remap, and our feature engineering aligns with the year 2019 better.

3.2 Feature Engineering

The feature table below demonstrates five types of features and the sources of each feature. All data comes from the following sources: APC aggregated and disaggregated data, Capital Metro, OpenStreetMap (OSM), Open Data Austin, and ACS Census.

In the amenity category, information about where the amenities located is collected from OSM. The examples of the amenities are stadiums, supermarkets, offices, train stations, etc. For amenity, we created buffers of each stop with the size of 1/2 mile, 1/4 mile and 1/8 mile. The number of each type of amenity within the buffer is calculated. In order to capture the distance factor related between stops and amenities, the distance between the stop and the closest 3 amenities are calculated as well.

In the built environment category, land use types, building area, neighborhood fixed effect and school district fixed effect are included. Some features are spatially joined with the stops, such as the neighborhood and the school district data. Other features such as land use and building area, the percentage of each type of land use and the total area of buildings within the buffer is calculated as well. Noted here that the three different buffer sizes are all tested here for capturing more variation in the dataset.

In the demographics category, data about population, median income and car ownership is collected. For demographics, we used Areal Weighted Interpolation and joined the weighted census estimates within the buffer to each stop.

For route information, the percentage of each route type passing each bus stop is calculated. The number of shifts going through each stop on a given week is also calculated.

For the internal (stop characteristics) category, we first added a transit hub dummy defined by Capital Metro; we then calculated the spatial lag, which is the average ridership of the surrounding stops.

A series of analysis is conducted to see what features are important. We first looked at the correlation between all features and the dependent variable, the mean ridership in 2019. Below are selected features that highly correlate with ridership positively and negatively. We found that the route information, amenity distance, and stop characteristics often correlates highly with ridership.

A correlation matrix is made to see potential collinearity between all features as well as the dependent variable. We found that certain features correlate highly, such as route direction feature SouthNorth versus WestEast, land use feature commercial versus residential, amenity feature distance to CBD versus distance to train stations. It is important to identify these variables when using features in the model.

3.3 Results and Validation

As mentioned in the exploratory analysis, neighborhood has been contributing a huge impact on ridership, similarly, a typology division of downtown and UT versus the rest of the Austin also captures the ridership pattern difference a lot becuase of the city’s natural built environments or school schedules. Thus in the modeling validation section, neighborhood and city typology (downtown, UT Austin, and the rest) are used to test the model’s generalizability.

With the features created in five categories, there are four types of models built: simple linear regression model (lm in the following visualizations), lasso and ridge regression model (glmnet in the following visualizations); random forest model (rf in the following visualizations) and xgboost model (xgb in the following visualizations). They will be tested and validated through a generalizability test to see which model fits the best.

Apart from the original 1/4 mile buffer, two more buffers sizes were created during feature engineering. The dataset corresponding to each buffer size will be tested and validated through the generalizability test, and the best buffer size will be selected.

3.3.1 Selecting the Best Model

We tested the generalizability of the four models mentioned above by holding out each neighborhood each time, use the rest data to train the model, and compare and calculate the prediction error of the hold-out neighborhood.

The MAPE and MAE of the four models reveals that, genrally speaking, random forest model is the model with the best accuracy while lasso and ridge model gives prediction less accurately than other three. Lasso and ridge regression has the largest MAPE and MAE while random forest model has the least MAPE and MAE.

In terms of predicted value and actual values, simple linear regression tends to underpredict ridership when the actual ridership is higher than 300.Lasso and ridge model generally overpredict ridership. Random forest model tends to overpredict ridership when the actual ridership is over 250. Xgboost model generally overpredict ridership but performs better than glmnet model when the actual ridership is low.

In order to test the the generaliability of the models on neighborhoods, the following bar chart reveals each model’s MAE in the neighborhoods. The charts demonstrate most of the MAEs are below 100. There are several neighborhoods appear to have a higher MAE among which most of them gathered around UT.

To take a closer look at the neighborhoods, maps of MAPE are plotted to show the generalizability. As mentioned before, the model’s acuuracy is lower in the neighborhoods around University of Texas.

3.3.2 Selecting the Best Buffer Size

It is hard to arbitrarily decide what the best buffer size is in capturing most of the variations of the dataset. As the bus stops are relatively densely distributed in Austin, we used our knowledge for walkable distances and start testing from 1/2 mile and gradually reduce the size to 1/4 mile and 1/8 mile. By comparing the r square, RMSE, MAE and MAPE of each model with different buffer size, and further testing the generalizability of the models, we will get the best buffer size.

1/2 mile Buffer

In each buffer size, four types of models are compared.For each buffer size, the model with the best performance will be picked and enter the final comparison. In 1/2 mile buffer size test, the random forest model has the largest r-square of 0.81 and lowest MAE of 64 and MAPE of 30.4%. Among the four models, this rf model has the best performance.

1/4 mile Buffer

In 1/4 mile buffer size test, the random forest model has the largest r-square of 0.79 and lowest MAE of 69.3 and MAPE of 28.8%. This rf model has the best performance among the four models here.

1/8 mile Buffer

In 1/8 mile buffer size test, the random forest model has the largest r-square of 0.72 and lowest MAE of 77.8 and MAPE of 33.6%. This rf model has the best performance among the four models here.

After the above comparison, the largest r-square is from 1/2-mile buffer while the least MAPE is from 1/4-mile. The possible reason for the lose of 1/8-mile buffer is it cannot sucessfully capture enough variation within the limited radius of the stops. Especially for the areas where ridership is high, such as CBD or UT, there can be a lot of variances outside the 1/8 mile.

Then if we take a look at the generalizability of the three buffer sizes on their performance in CBD, UT and the rest areas, it is clear that 1/8 mile buffer fails to achieve generaliability as well as other two buffer sizes. the MAE in CBD and UT are very high. For the comparison between 1/2 mile buffer and 1/4 mile buffer, the MAE demonstrates a better generalizability for 1/2 mile buffer for its smaller value in UT. However, considering our use case where we allow planners to test different development at specific locations, a smaller buffer size would more accurately reflect real development in the world. Thus, the final buffer radius is settled to 1/4 mile.

3.3.3 Validation using Poverty Condition

In order to further examine our model’s generalizability, a poverty context has been created according to national poverty rate for metro area: 12.6%. It means that, in the following map, the areas marked as majority poverty have more than 12.6% population in poverty line. From this map it is clear that Austin has been divided into two part: north and west are relatively wealthy while south and east have more population in poverty.

According to the performance in the following chart, our model is pretty generalizable in terms of MAPE. In the comparison, xgboost model has a better generalizability across different poverty groups. The MAE and MAPE are smaller than random forest model. Although the MAPE for mojority poverty group is smaller than mojority non-poverty group, the higher MAE indicates the possibility to redesign the system with bias.

# 4. Application

4.1 Three different scenarios for ridership change

4.1.1 What is the change of transit ridership with route frequency update?

In this scenario, we changed the route frequency for all the feeder routes to make them becomes high frequency routes (meaning less than 15 min headway). The impact is very obvious that many stops would experience ridership increase while only several stops close to downtown would experience ridership decrease.

4.1.2 What is the change of transit ridership with new real estate development?

In this scenario, we suppose real estate development on the outskirt area of Austin. With the impact of new development, there are basically no stop with declining ridership.

4.1.3 What is the change of transit ridership with new landuse update?

In this scenario, landuse percentage around certain stops are changed. Around these stops, commercial landuse percentage has been increased while residential landuse percentage has been decreased. This change mainly leads to an increase on the ridership for most stops.

4.2 Web Application

We developed a web application according to our use case: helping planners to evision the potential change to transit ridership with respect to the change in real-estate. The following screenshots demonstrate the prototype. The first three screen shots demonstrate how users can investigate the current ridership patterns, and the last screen shows that the web app can give ridership projections according when the user selects any particular scenarios.

5. Appendix

Setup

######### Set Up Functions and Plotting Options ######### 
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

#plotTheme <- theme(
#  plot.title =element_text(size=12),
#  plot.subtitle = element_text(size=8),
#  plot.caption = element_text(size = 6),
#  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
#  axis.text.y = element_text(size = 10),
#  axis.title.y = element_text(size = 10),
#  panel.background=element_blank(),
#  plot.background=element_blank(),
#  axis.ticks=element_blank())

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=1.5),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

Background

Data Structure

#turn dataframe into spacitial object
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

disagg_sf <- disagg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

# We use aggregated data to look at the average ridership on weekdays at individual stops
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = subset(agg_after_sf, STOP_ID == 476), aes(color = "Stop 476"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stop 476" = "darkorange"),
                      guide = guide_legend("Aggregated Data Example"))+
  labs(title = "Aggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  ggrepel::geom_label_repel(
    data = subset(agg_after_sf, STOP_ID == 476),aes(label = "Average Ridership = 33 \n Average Passing Buses = 55", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 3)+mapTheme()

# We use disaggregated data to investigate the average ridership on weekdays on different routes.
disagg_803 <- subset(disagg_sf, ROUTE == 803)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(PSGR_ON),
            avg_load = mean(PSGR_LOAD))
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = disagg_803, aes(color = "Stops on Route 803"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stops on Route 803" = "darkorange"),
                      guide = guide_legend("Disggregated Data Example"))+
  labs(title = "Disaggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  geom_label_repel(
    data = subset(disagg_803, STOP_ID == 2606),aes(label = "Average On-board Passengers of Stop 2606 = 11 \n Route Type = Metro Rapid", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 0,
    segment.color = "lightgrey",
    point.padding = 20)+mapTheme()

Route Types Changes

# Crosstown
crosstown <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Crosstown Routes Before and After Cap Remap")+mapTheme()

# Feeder
feeder <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Feeder Routes Before and After Cap Remap")+mapTheme()


# Flyer
flyer <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Flyer Routes Before and After Cap Remap")+mapTheme()

# Express
express <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Express Routes Before and After Cap Remap")+mapTheme()

# Special
special <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Speical Routes Before and After Cap Remap")+mapTheme()

# minor changes grid arrange

grid.arrange(crosstown, feeder, flyer, express, ncol =2)

Exploratory Analysis

Ridership Typology

#create stop shapefile
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

agg_sf19 <- agg_sf%>%
  filter(YEAR_ID == 2019)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(AVERAGE_ON))

#Read UT and CBD shapefiles
UT <- st_read("D:/Spring20/Practicum/data/UTAustin/UT.shp")%>%
  st_transform(2278)
CBD <- st_read("D:/Spring20/Practicum/data/CBD/CBD.shp")%>%
  st_transform(2278)

#create shapefile of area outside of CBD
nhood_CBD <- st_difference(nhood_merge, CBD)

#ST_DIFFERENCE didnt work for UT; created in Arcmap
nhood_UT <- st_read("D:/Spring20/Practicum/data/nhood_UT.shp")%>%
  st_as_sf()%>%
  st_transform(2278)

#Create CBD typology
agg_sf19_CBD <- st_join(CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "CBD")

agg_sf19_oCBD <- st_join(nhood_CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "oCBD")%>%
  rename(geometry = x)

agg_sf19_oCBD <- agg_sf19_oCBD%>%
  group_by(Id)%>%
  summarize(avg_on = mean(avg_on))%>%
  mutate(label = "The Rest of Austin")

agg_CBD_typology <- rbind(agg_sf19_CBD,agg_sf19_oCBD)

#Create UT typology
agg_sf19_UT <- st_join(UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "UT")

agg_sf19_UT <- agg_sf19_UT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "UT Austin")

agg_sf19_oUT <- st_join(nhood_UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "oUT")%>%
  select(STOP_ID,
         avg_on,
         typology,
         geometry)

agg_sf19_oUT <- agg_sf19_oUT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "The Rest of Austin")

agg_UT_typology <- rbind(agg_sf19_UT,agg_sf19_oUT)

How did ridership change before and after CapRemap (06/03, 2018)?

Ridership Change in Different Neighborhoods in Austin in 2018

Identify the Hotlines

First, let us look at the Kmeans analysis before the CapRemap. We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route.

Then, we apply Kmeans analysis. The number of clusters are determined by both the elbow chart and the 26 criteria provided by the NbClus package. For more information, see appendix.

We do the same analysis to the disaggregated dataset after the CapRemap.

These clustering labels are joined to the original dataset. For more about the clustering result, please see appendix.

Find the number of kmeans clusters for both before and after the CapRemap:

Both the Elbow chart and the 26 indicies provided by the NbClust package are used to check how many clusters should be used in the Kmeans analysis.

Before CapRemap:

After CapRemap:

In either case, it is evident that the most optimal number for the Kmeans cluster analysis is 3. We then conduct Kmeans analysis with 3 clusters as mentioned above in the exploratory analysis section.

Here is the Kmeans analysis result we got for before and after the CapRemap. The numbers are average of each feature used in the Kmeans analysis. We can clearly see that cluster 2 for both before and after the remapping have the highest average ridership as well as run times. They also have the smallest size. We can conclude that these are the most popular routes and we then define these routes as ‘hotlines’.

routeplot1 <- function(n,p,p1,d) {
  # line n before map
  t1 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, Before CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n before passenger load chart
  t11 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n before passenger boarding and alighting
  t12 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  # line n after map
  t2 = ggplot() +
  geom_sf(data = nhood, color = 'grey30',fill = 'grey20') +
  geom_sf(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p1) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  labs(title=paste("Line",n,"Direction 1, After CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n after passenger load chart
  t21 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n after passenger boarding and alighting
  t22 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  grid.arrange(arrangeGrob(t1, t11, t12, ncol = 1, nrow = 3),
               arrangeGrob(t2, t21, t22, ncol = 1, nrow = 3),ncol=2)
}

Feature Engineering

Amenities (use buffer)

Open Street Map (OSM) Amenity Counts

######### Get OSM Data #########
getOSM <- function(key,value){
  feature <- opq(bbox = 'Austin, Texas')%>%
    add_osm_feature(key = key, value = value) %>%
    osmdata_sf ()
  if(is.null(feature$osm_points)){
    feature_poly <- feature$osm_polygons%>%
      select(osm_id,geometry)%>%
      st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
      st_transform(2278)
    return(feature_poly)
  } else {
  feature_pt <- feature$osm_points%>%
    select(osm_id,geometry)%>%
    st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
    st_transform(2278)
  return (feature_pt)
  }
}

#commercial
commercial <- getOSM('building', 'commercial')
#retail
retail <- getOSM('building', 'retail')
#supermarket
supermkt <- getOSM('building', 'supermarket')
#office
office <- getOSM('building', 'office')
#residential
residential <- getOSM('building','residential')
#bar
bar <- getOSM('amenity', 'bar')
#school
school <- getOSM('amenity', 'school')
#uni
university <- getOSM('amenity', 'university')
#parking
parking <- getOSM('amenity', 'parking')
#statium
stadium <- getOSM('building', 'stadium')
#trainstation
trainstation <- getOSM('building', 'train_station')

######### spatial join #########
bufferInit <- function(Buffer, Points, Name){
  if(class(Points$geometry) == "sfc_POINT"){
  Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_contains)%>%
  group_by(STOP_ID)%>%
    summarize(count = n())%>%
    rename(!!Name := count)
  }else {
    Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_intersects)%>%
      group_by(STOP_ID)%>%
      summarize(count = n())%>%
      rename(!!Name := count)
  }
}

Amenity Distance

Built Environments

Demographics

######### census #########
options(tigris_use_cache = TRUE)
v17 <- load_variables(2017, "acs5", cache = TRUE)

Hays <- get_acs(state = "48", county = "209", geography = "tract", 
                variables = "B01001_001", geometry = TRUE)
Travis <- get_acs(state = "48", county = "453", geography = "tract", 
                  variables = "B01001_001", geometry = TRUE)
Williamson <- get_acs(state = "48", county = "491", geography = "tract", 
                      variables = "B01001_001", geometry = TRUE) 

Travis_race <- get_acs(state = "48", county = "453", geography = "tract", 
                       variables = "B02001_002", geometry = TRUE)
Williamson_race <- get_acs(state = "48", county = "491", geography = "tract", 
                           variables = "B02001_002", geometry = TRUE) 

Travis_noveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_002", geometry = TRUE)
Williamson_noveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_002", geometry = TRUE)

Travis_oneveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_003", geometry = TRUE)
Williamson_oneveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_003", geometry = TRUE)

Travis_twoveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_004", geometry = TRUE)
Williamson_twoveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_004", geometry = TRUE)

Travis_threeveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_005", geometry = TRUE)
Williamson_threeveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_005", geometry = TRUE)

Travis_fourveh <- get_acs(state = "48", county = "453", geography = "tract", 
                           variables = "B08014_006", geometry = TRUE)
Williamson_fourveh <- get_acs(state = "48", county = "491", geography = "tract", 
                               variables = "B08014_006", geometry = TRUE)

Travis_fiveveh <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B08014_007", geometry = TRUE)
Williamson_fiveveh <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B08014_007", geometry = TRUE)

Travis_poverty <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B06012_002", geometry = TRUE)
Williamson_poverty <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B06012_002", geometry = TRUE)

Travis_medInc <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B19013_001", geometry = TRUE)
Williamson_medInc <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B19013_001", geometry = TRUE)
######### buffer deomographics #########
#population
Population <- rbind(Travis, Williamson)%>%
  st_transform(2278)
Population_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Population, sid = GEOID, weight = "sum",
                                  output = "sf", extensive = "estimate")
Population_buff$estimate<- round(Population_buff$estimate)

#race
Race <- rbind(Travis_race, Williamson_race)%>%
  st_transform(2278)
Race_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Race, sid = GEOID, weight = "sum",
                            output = "sf", extensive = "estimate")
Race_buff$estimate <- round(Race_buff$estimate)

#vehicle ownership
NoVeh <- rbind(Travis_noveh, Williamson_noveh)%>%
  st_transform(2278)
NoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = NoVeh, sid = GEOID, weight = "sum",
                             output = "sf", extensive = "estimate")
NoVeh_buff$estimate <- round(NoVeh_buff$estimate)


OneVeh <- rbind(Travis_oneveh, Williamson_oneveh)%>%
  st_transform(2278)
OneVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = OneVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
OneVeh_buff$estimate <- round(OneVeh_buff$estimate)


TwoVeh <- rbind(Travis_twoveh, Williamson_twoveh)%>%
  st_transform(2278)
TwoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = TwoVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
TwoVeh_buff$estimate <- round(TwoVeh_buff$estimate)


ThreeVeh <- rbind(Travis_threeveh, Williamson_threeveh)%>%
  st_transform(2278)
ThreeVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = ThreeVeh, sid = GEOID, weight = "sum",
                                output = "sf", extensive = "estimate")
ThreeVeh_buff$estimate <- round(ThreeVeh_buff$estimate)


FourVeh <- rbind(Travis_fourveh, Williamson_fourveh)%>%
  st_transform(2278)
FourVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FourVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FourVeh_buff$estimate <- round(FourVeh_buff$estimate)


FiveVeh <- rbind(Travis_fiveveh, Williamson_fiveveh)%>%
  st_transform(2278)

FiveVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FiveVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FiveVeh_buff$estimate <- round(FiveVeh_buff$estimate)


#poverty
Poverty <- rbind(Travis_poverty, Williamson_poverty)%>%
  st_transform(2278)
Poverty_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Poverty, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
Poverty_buff$estimate <- round(Poverty_buff$estimate)


#MedInc
medInc <- rbind(Travis_medInc, Williamson_medInc)%>%
  st_transform(2278)
medInc_stop <- st_join(stops, medInc, join = st_intersects)
medInc_stop <- medInc_stop %>%
  st_drop_geometry() %>%
  select(STOP_ID, estimate) %>%
  rename(medInc = estimate)

Route Network

use datasets after cap remap

Join All Features

all_x1 <- CommercialInit %>%  #amenities and route related
  left_join(st_drop_geometry(RetailInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OfficeInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ResidentialInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SupermktInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(BarInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(UniInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ParkingInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SchoolInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StationInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StadiumInit), by = "STOP_ID") %>%
  left_join(stop_dir_freq, by = "STOP_ID") %>%
  left_join(stop_type_freq, by = "STOP_ID") %>%
  left_join(stop_hot_freq, by = "STOP_ID") %>%
  left_join(build_dens, by = "STOP_ID") %>%
  left_join(st_drop_geometry(stop_buff_landuse), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Race_buff) %>% rename(race = estimate) %>% select(STOP_ID, race) %>% mutate(STOP_ID = as.numeric(STOP_ID), race = as.numeric(race)), by = "STOP_ID") %>% #census data
  left_join(st_drop_geometry(Population_buff) %>%  rename(population = estimate) %>% select(STOP_ID, population) %>% mutate(STOP_ID = as.numeric(STOP_ID), population = as.numeric(population)), by = "STOP_ID") %>% 
  left_join(st_drop_geometry(NoVeh_buff) %>%  rename(NoVeh = estimate) %>% select(STOP_ID, NoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), NoVeh = as.numeric(NoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OneVeh_buff) %>%  rename(OneVeh = estimate) %>% select(STOP_ID, OneVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), OneVeh = as.numeric(OneVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(TwoVeh_buff) %>%  rename(TwoVeh = estimate) %>% select(STOP_ID, TwoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), TwoVeh = as.numeric(TwoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ThreeVeh_buff) %>%  rename(ThreeVeh = estimate) %>% select(STOP_ID, ThreeVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), ThreeVeh = as.numeric(ThreeVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FourVeh_buff) %>%  rename(FourVeh = estimate) %>% select(STOP_ID, FourVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FourVeh = as.numeric(FourVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FiveVeh_buff) %>%  rename(FiveVeh = estimate) %>% select(STOP_ID, FiveVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FiveVeh = as.numeric(FiveVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Poverty_buff) %>%  rename(Poverty = estimate) %>% select(STOP_ID, Poverty) %>% mutate(STOP_ID = as.numeric(STOP_ID), Poverty = as.numeric(Poverty)), by = "STOP_ID") %>%
  left_join(medInc_stop, by= "STOP_ID") %>%
  left_join(st_drop_geometry(stop_nhood), by = "STOP_ID") %>% # fixed effects
  left_join(st_drop_geometry(stop_school), by = "STOP_ID") %>%
  select(-c(hotline_0)) %>%
  left_join(data.2019.mean, by = "STOP_ID") %>%
  left_join(morning, by = "STOP_ID")%>%
  #left_join(afternoon, by = "STOP_ID")%>%
  left_join(evening, by = "STOP_ID") %>%
  left_join(route_fix, by = c("STOP_ID" = "r")) %>%
  left_join(bus_count, by = "STOP_ID")

#spatial lag, knn dist
all_x3 = bind_cols(list(all_x1, utaustinDist, CBDDist, commercialDist, retailDist, supermktDist, officeDist, residentialDist, schoolDist, universityDist, parkingDist, stadiumDist, trainstationDist, airportDist, spatial_lag %>% select(spatial_lag2, spatial_lag3, spatial_lag5)))
#recategorize variables
all_x4_normalize <-
  all_x3 %>% 
  mutate(Clockwise_cat = case_when(
      Clockwise == 0 ~ "0",
      Clockwise == 1 ~ "1",
      Clockwise > 0 & Clockwise <1 ~ "others"),
    Counterclockwise_cat = case_when(
      Counterclockwise == 0 ~ "0",
      Counterclockwise == 1 ~ "1",
      Counterclockwise > 0 & Counterclockwise <1 ~ "others"),
    Crosstown_cat = case_when(
      Crosstown == 0 ~ "0",
      Crosstown == 1 ~ "1",
      Crosstown > 0 & Crosstown <1 ~ "others"),
    Express_cat = case_when(
      Express == 0 ~ "0",
      Express == 1 ~ "1",
      Express > 0 & Express <1 ~ "others"),
    Feeder_cat = case_when(
      Feeder == 0 ~ "0",
      Feeder == 1 ~ "1",
      Feeder > 0 & Feeder <1 ~ "others"),
    Flyer_cat = case_when(
      Flyer == 0 ~ "0",
      Flyer == 1 ~ "1",
      Flyer > 0 & Flyer <1 ~ "others"),
    HighFreq_cat = case_when(
      `High Frequency` == 0 ~ "0",
      `High Frequency` == 1 ~ "1",
      `High Frequency` > 0 & `High Frequency` <1 ~ "others"),
    hotline_cat = case_when(
      hotline_1 == 0 ~ "0",
      hotline_1 == 1 ~ "1",
      hotline_1 > 0 & hotline_1 <1 ~ "others"),
    InOut_cat = case_when(
      InOut == 0 ~ "0",
      InOut == 1 ~ "1",
      InOut > 0 & Flyer <1 ~ "others"),
    Local_cat = case_when(
      Local == 0 ~ "0",
      Local == 1 ~ "1",
      Local > 0 & Local <1 ~ "others"),
    NightOwl_cat = case_when(
      `Night Owl` == 0 ~ "0",
      `Night Owl` == 1 ~ "1",
      `Night Owl` > 0 & `Night Owl` <1 ~ "others"),
    SN_cat = case_when(
      SouthNorth == 0 ~ "0",
      SouthNorth == 1 ~ "1",
      SouthNorth > 0 & SouthNorth <1 ~ "others"),
    Special_cat = case_when(
      Special == 0 ~ "0",
      Special == 1 ~ "1",
      Special > 0 & Special <1 ~ "others"),
    utshuttle_cat = case_when(
      `UT Shuttle` == 0 ~ "0",
      `UT Shuttle` == 1 ~ "1",
      `UT Shuttle` > 0 & `UT Shuttle` <1 ~ "others"),
    WE_cat = case_when(
      WestEast == 0 ~ "0",
      WestEast == 1 ~ "1",
      WestEast > 0 & WestEast <1 ~ "others"))

Features Explorations

change features and see results

# plot original predictions
lmreg <- lm(mean_on ~ .,data = all_x4_normalize %>% st_drop_geometry() %>% select(building_area, civic, commercial, residential, industrial, SN_cat, Crosstown_cat, Express_cat, Local_cat, Flyer_cat, NightOwl_cat, HighFreq_cat, InOut_cat,Clockwise_cat, hotline_1,utshuttle_cat, Special_cat, school_count, stadium_count, medInc, nshifts, mean_on))
summary(lmreg)

lm_model0 <-
  all_x4_normalize %>%
  mutate(ridership.Predict = predict(lmreg, all_x4_normalize)) %>%
  mutate(pred_err = ridership.Predict-mean_on,
         pred_err_p = (ridership.Predict-mean_on)/mean_on)

grid.arrange(
ggplot()+
  geom_sf(data = nhood, color = 'grey40',fill = 'grey40') +
  geom_sf(data = st_centroid(na.omit(lm_model0)), aes(color = pred_err),size = 0.9) +
  scale_color_gradientn(colors = c("#b2182b", "#ef8a62", "#fddbc7","#d1d1d1","#67a9cf", "#2166ac"), limits = c(-750,500))+
  labs(title = "Ridership Prediction Error") +
  mapTheme(),

ggplot()+
  geom_sf(data = nhood, color = 'grey40',fill = 'grey40') +
  geom_sf(data = st_centroid(na.omit(lm_model0)), aes(color = pred_err_p),size = 0.9) +
  scale_color_gradientn(colors = c("#b2182b","#ef8a62", "#d1d1d1","#67a9cf","#2166ac"), limits = c(-40,40))+
  labs(title = "Ridership Prediction Error Percentage") +
  mapTheme(),ncol=2)

#agg nhood
nhood0 <- nhood %>% left_join(lm_model0 %>% 
            na.omit() %>%
            group_by(label) %>% 
            summarise(Pred.err.sum = sum(pred_err), total.ridership = sum(mean_on)) %>% 
            select(label, Pred.err.sum, total.ridership) %>%
            st_drop_geometry(), by = "label")

ggplot()+
  geom_sf(data = nhood0, aes(fill = Pred.err.sum)) +
  labs(title = "Prediction Error by Neighborhood") +
  scale_fill_gradientn(colors = c("#b2182b", "#f4a582", "#f7f7f7", "#2166ac"), limits = c(-5600,3000))+
  mapTheme()

ggplot()+
  geom_sf(data = nhood0, aes(fill = total.ridership)) +
  labs(title = "Ridership by Neighborhood") +
  mapTheme()

Building area SCENARIO 1:

Building area SCENARIO 2:

Building area SCENARIO 3:

Modeling and Validation

Use Neighborhood For Validation with Four Types of Models

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "MAPE of each model on the testing set")
  theme_bw()
#MAE chart
ggplot(data = val_preds %>% 
           dplyr::select(model, MAE) %>% 
           distinct() , 
         aes(x = model, y = MAE, group = 1)) +
    geom_path(color = "blue") +
    geom_label(aes(label = paste0(round(MAE,1)))) +
    labs(title= "MAE of each model on the testing set")
  theme_bw()  
  
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
  theme_bw()

#Neighborhood validation
val_MAPE_by_hood <- val_preds %>% 
  group_by(label, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


val_MAPE_by_hood %>%
  dplyr::select(label, model, MAE) %>%
  gather(Variable, MAE, -model, -label) %>%
  ggplot(aes(label, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  scale_fill_manual(values = "Blues") +
  facet_wrap(~label,scales="free", ncol=6)+
  labs(title = "Mean Absolute Errors by model specification and neighborhood") +
  plotTheme()

#Map of MAE in each neighborhood
#Add geometry to the MAE
MAE.nhood <- merge(nhood, val_MAPE_by_hood, by.x="label", by.y="label", all.y=TRUE)

#Produce the map

#Map: MAPE of lm
MAE.nhood%>%
  filter(model=="lm") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of lm in Neighborhoods") +
  mapTheme()

#Map: MAPE of glmnet
MAE.nhood%>%
  filter(model=="glmnet") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of glmnet in Neighborhoods") +
  mapTheme()
#MAPE of rf
MAE.nhood%>%
  filter(model=="rf") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of rf in Neighborhoods") +
  mapTheme()

#MAPE of xgb
MAE.nhood%>%
  filter(model=="xgb") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of xgb in Neighborhoods") +
  mapTheme()

Testing different buffer size for model accuracy and generalizability

Buffer size 1/2 mile

#1/2 Buffer Size with Typology Test
data.half <- join(all_half, typology, type ="left")
data.half$STOP_ID <- NULL
data.half<-data.half %>%
  drop_na()
data.half$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.half <- rsample::initial_split(data.half, strata = "mean_on", prop = 0.75)

bus_train.half <- rsample::training(data_split.half)
bus_test.half  <- rsample::testing(data_split.h)
names(bus_train.half)


cv_splits_geo.half <- rsample::group_vfold_cv(bus_train.half,  strata = "mean_on", group = "typology")

#Create recipe
model_rec.half <- recipe(mean_on ~ ., data = bus_train.half) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.half

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.half <-
  workflows::workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(lm_plan)

glmnet_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(glmnet_plan)

rf_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(rf_plan)
xgb_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.half <- lm_wf.half %>%
  fit_resamples(.,
                resamples = cv_splits_geo.half,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.half <- glmnet_wf.half %>%
  tune_grid(.,
            resamples = cv_splits_geo.half,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.half <- rf_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.half <- xgb_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.half, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.half     <- select_best(lm_tuned.half, metric = "rmse", maximize = FALSE)
glmnet_best_params.half <- select_best(glmnet_tuned.half, metric = "rmse", maximize = FALSE)
rf_best_params.half     <- select_best(rf_tuned.half, metric = "rmse", maximize = FALSE)
xgb_best_params.half    <- select_best(xgb_tuned.half, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.half     <- finalize_workflow(lm_wf.half, lm_best_params.half)
glmnet_best_wf.half <- finalize_workflow(glmnet_wf.half, glmnet_best_params.half)
rf_best_wf.half     <- finalize_workflow(rf_wf.half, rf_best_params.half)
xgb_best_wf.half    <- finalize_workflow(xgb_wf.half, xgb_best_params.half)

lm_val_fit_geo.half <- lm_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.half <- glmnet_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.half <- rf_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.half <- xgb_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.half <- collect_predictions(lm_tuned.half) 
glmnet_best_OOF_preds.half <- collect_predictions(glmnet_tuned.half) %>% 
  filter(penalty  == glmnet_best_params.half$penalty[1] & mixture == glmnet_best_params.half$mixture[1])
rf_best_OOF_preds.half <- collect_predictions(rf_tuned.half) %>% 
  filter(mtry  == rf_best_params.half$mtry[1] & min_n == rf_best_params.half$min_n[1])

xgb_best_OOF_preds.half <- collect_predictions(xgb_tuned.half) %>% 
  filter(mtry  == xgb_best_params.half$mtry[1] & min_n == xgb_best_params.half$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.half     <- collect_predictions(lm_val_fit_geo.half)
glmnet_val_pred_geo.half <- collect_predictions(glmnet_val_fit_geo.half)
rf_val_pred_geo.half     <- collect_predictions(rf_val_fit_geo.half)
xgb_val_pred_geo.half    <- collect_predictions(xgb_val_fit_geo.half)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.half <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.half, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.half, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.half, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.half, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.half <- rbind(data.frame(lm_val_pred_geo.half, model = "lm"),
                   data.frame(glmnet_val_pred_geo.half, model = "glmnet"),
                   data.frame(rf_val_pred_geo.half, model = "rf"),
                   data.frame(xgb_val_pred_geo.half, model = "xgb")) %>% 
  left_join(., data.half %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(lm_val_pred_geo.half$MAE)
summary(glmnet_val_pred_geo.half$MAE)
summary(rf_val_pred_geo.half$MAE)
summary(xgb_val_pred_geo.half$MAE)
summary(lm_val_pred_geo.half$MAPE)
summary(glmnet_val_pred_geo.half$MAPE)
summary(rf_val_pred_geo.half$MAPE)
summary(xgb_val_pred_geo.half$MAPE)
summary(lm_val_pred_geo.half$RMSE)
summary(glmnet_val_pred_geo.half$RMSE)
summary(rf_val_pred_geo.half$RMSE)
summary(xgb_val_pred_geo.half$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.half, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.half, mean_on, .pred)
rsq(rf_val_pred_geo.half, mean_on, .pred)
rsq(xgb_val_pred_geo.half, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.half %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/2-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.half%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/2 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.half%>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/2 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.half, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/2 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.half <- val_preds.half %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.half)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/2 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/4 mile

#1/4 mile buffer with ridership: data.quarter
#typology
sum(is.na(All_3))
sum(is.na(all_2))
All_3$typology
summary(all_2$parkingDist)
typology<- All_3 %>%
  dplyr::select(STOP_ID, typology)
typology$typology <- ifelse(typology$typology == "CBD" , 'CBD',
                                ifelse(typology$typology == "UT", 'UT',
                                       ifelse(typology$typology == "UT&CBD", 'CBD', 'Rest')))

#write.csv(typology, "C:/Upenn/Practicum/Data/Typology_withSTOP_ID.csv")
?join
names(all_2)
data.quarter <- plyr::join(all_2, typology, type ="left")
data.quarter$STOP_ID <- NULL

data.quarter<-data.quarter %>%
  drop_na()
data.quarter$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.quarter <- rsample::initial_split(data.quarter, strata = "mean_on", prop = 0.75)

bus_train.quarter <- rsample::training(data_split.quarter)
bus_test.quarter  <- rsample::testing(data_split.quarter)
names(bus_train.quarter)


cv_splits_geo.quarter <- rsample::group_vfold_cv(bus_train.quarter,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.quarter <- recipe(mean_on ~ ., data = bus_train.quarter) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.quarter

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.quarter <-
  workflows::workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(lm_plan)

glmnet_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(glmnet_plan)

rf_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(rf_plan)
xgb_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.quarter <- lm_wf.quarter %>%
  fit_resamples(.,
                resamples = cv_splits_geo.quarter,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.quarter <- glmnet_wf.quarter %>%
  tune_grid(.,
            resamples = cv_splits_geo.quarter,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.quarter <- rf_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.quarter <- xgb_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.quarter     <- select_best(lm_tuned.quarter, metric = "rmse", maximize = FALSE)
glmnet_best_params.quarter <- select_best(glmnet_tuned.quarter, metric = "rmse", maximize = FALSE)
rf_best_params.quarter     <- select_best(rf_tuned.quarter, metric = "rmse", maximize = FALSE)
xgb_best_params.quarter    <- select_best(xgb_tuned.quarter, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.quarter     <- finalize_workflow(lm_wf.quarter, lm_best_params.quarter)
glmnet_best_wf.quarter <- finalize_workflow(glmnet_wf.quarter, glmnet_best_params.quarter)
rf_best_wf.quarter     <- finalize_workflow(rf_wf.quarter, rf_best_params.quarter)
xgb_best_wf.quarter    <- finalize_workflow(xgb_wf.quarter, xgb_best_params.quarter)

lm_val_fit_geo.quarter <- lm_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.quarter <- glmnet_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.quarter <- rf_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.quarter <- xgb_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.quarter <- collect_predictions(lm_tuned.quarter) 
glmnet_best_OOF_preds.quarter <- collect_predictions(glmnet_tuned.quarter) %>% 
  filter(penalty  == glmnet_best_params.quarter$penalty[1] & mixture == glmnet_best_params.quarter$mixture[1])
rf_best_OOF_preds.quarter <- collect_predictions(rf_tuned.quarter) %>% 
  filter(mtry  == rf_best_params.quarter$mtry[1] & min_n == rf_best_params.quarter$min_n[1])

xgb_best_OOF_preds.quarter <- collect_predictions(xgb_tuned.quarter) %>% 
  filter(mtry  == xgb_best_params.quarter$mtry[1] & min_n == xgb_best_params.quarter$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.quarter     <- collect_predictions(lm_val_fit_geo.quarter)
glmnet_val_pred_geo.quarter <- collect_predictions(glmnet_val_fit_geo.quarter)
rf_val_pred_geo.quarter     <- collect_predictions(rf_val_fit_geo.quarter)
xgb_val_pred_geo.quarter    <- collect_predictions(xgb_val_fit_geo.quarter)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
lm_best_OOF_preds$mean_on <- as.numeric(lm_best_OOF_preds$mean_on)
glmnet_best_OOF_preds$mean_on <- as.numeric(glmnet_best_OOF_preds$mean_on)
rf_best_OOF_preds$mean_on <- as.numeric(rf_best_OOF_preds$mean_on)
xgb_best_OOF_preds$mean_on <- as.numeric(xgb_best_OOF_preds$mean_on)

OOF_preds.quarter <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.quarter, .pred, mean_on), model = "lm"),
                   data.frame(dplyr::select(glmnet_best_OOF_preds.quarter, .pred, mean_on), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds.quarter, .pred, mean_on), model = "RF"),
                   data.frame(dplyr::select(xgb_best_OOF_preds.quarter, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
         #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.quarter <- rbind(data.frame(lm_val_pred_geo.quarter, model = "lm"),
                   data.frame(glmnet_val_pred_geo.quarter, model = "glmnet"),
                   data.frame(rf_val_pred_geo.quarter, model = "rf"),
                   data.frame(xgb_val_pred_geo.quarter, model = "xgb")) %>% 
  left_join(., data.quarter %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(rf_val_pred_geo.quarter$MAE)
summary(xgb_val_pred_geo.quarter$MAE)
summary(lm_val_pred_geo.quarter$MAPE)
summary(glmnet_val_pred_geo.quarter$MAPE)
summary(rf_val_pred_geo.quarter$MAPE)
summary(xgb_val_pred_geo.quarter$MAPE)
summary(lm_val_pred_geo.quarter$RMSE)
summary(glmnet_val_pred_geo.quarter$RMSE)
summary(rf_val_pred_geo.quarter$RMSE)
summary(xgb_val_pred_geo.quarter$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.quarter, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.quarter, mean_on, .pred)
rsq(rf_val_pred_geo.quarter, mean_on, .pred)
rsq(xgb_val_pred_geo.quarter, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/4-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.quarter%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/4 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/4 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.quarter, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/4 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.quarter <- val_preds.quarter %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.quarter)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/4 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/2 mile

#1/8-mile buffer
data.eighth <- join(all_eighth, typology, type ="left")
data.eighth$STOP_ID <- NULL

data.eighth<-data.eighth %>%
  drop_na()
data.eighth$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.eighth <- rsample::initial_split(data.eighth, strata = "mean_on", prop = 0.75)

bus_train.eighth <- rsample::training(data_split.eighth)
bus_test.eighth  <- rsample::testing(data_split.eighth)
names(bus_train.eighth)


cv_splits_geo.eighth <- rsample::group_vfold_cv(bus_train.eighth,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.eighth <- recipe(mean_on ~ ., data = bus_train.eighth) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.eighth

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.eighth <-
  workflows::workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(lm_plan)

glmnet_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(glmnet_plan)

rf_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(rf_plan)
xgb_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.eighth <- lm_wf.eighth %>%
  fit_resamples(.,
                resamples = cv_splits_geo.eighth,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.eighth <- glmnet_wf.eighth %>%
  tune_grid(.,
            resamples = cv_splits_geo.eighth,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.eighth <- rf_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.eighth <- xgb_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.eighth     <- select_best(lm_tuned.eighth, metric = "rmse", maximize = FALSE)
glmnet_best_params.eighth <- select_best(glmnet_tuned.eighth, metric = "rmse", maximize = FALSE)
rf_best_params.eighth     <- select_best(rf_tuned.eighth, metric = "rmse", maximize = FALSE)
xgb_best_params.eighth    <- select_best(xgb_tuned.eighth, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.eighth     <- finalize_workflow(lm_wf.eighth, lm_best_params.eighth)
glmnet_best_wf.eighth <- finalize_workflow(glmnet_wf.eighth, glmnet_best_params.eighth)
rf_best_wf.eighth     <- finalize_workflow(rf_wf.eighth, rf_best_params.eighth)
xgb_best_wf.eighth    <- finalize_workflow(xgb_wf.eighth, xgb_best_params.eighth)

lm_val_fit_geo.eighth <- lm_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.eighth <- glmnet_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.eighth <- rf_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.eighth <- xgb_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.eighth <- collect_predictions(lm_tuned.eighth) 
glmnet_best_OOF_preds.eighth <- collect_predictions(glmnet_tuned.eighth) %>% 
  filter(penalty  == glmnet_best_params.eighth$penalty[1] & mixture == glmnet_best_params.eighth$mixture[1])
rf_best_OOF_preds.eighth <- collect_predictions(rf_tuned.eighth) %>% 
  filter(mtry  == rf_best_params.eighth$mtry[1] & min_n == rf_best_params.eighth$min_n[1])

xgb_best_OOF_preds.eighth <- collect_predictions(xgb_tuned.eighth) %>% 
  filter(mtry  == xgb_best_params.eighth$mtry[1] & min_n == xgb_best_params.eighth$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.eighth     <- collect_predictions(lm_val_fit_geo.eighth)
glmnet_val_pred_geo.eighth <- collect_predictions(glmnet_val_fit_geo.eighth)
rf_val_pred_geo.eighth     <- collect_predictions(rf_val_fit_geo.eighth)
xgb_val_pred_geo.eighth    <- collect_predictions(xgb_val_fit_geo.eighth)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.eighth <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.eighth, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.eighth, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.eighth, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.eighth, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
detach(package:plyr)
val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                   data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                   data.frame(rf_val_pred_geo.eighth, model = "rf"),
                   data.frame(xgb_val_pred_geo.eighth, model = "xgb")) %>% 
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = mean(abs(mean_on - .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


lm_val_pred_geo.eighth<- lm_val_pred_geo.eighth%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on))


lm_val_pred_geo.eighth%>%
  group_by(typology)%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         Error = abs(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #MAE  = yardstick::mae_vec(mean_on, .pred),
         MAE = mean(Error),
         MAPE = yardstick::mape_vec(mean_on, .pred))

glmnet_val_pred_geo.eighth<- glmnet_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

rf_val_pred_geo.eighth<- rf_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

xgb_val_pred_geo.eighth<- xgb_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                           data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                           data.frame(rf_val_pred_geo.eighth, model = "rf"),
                           data.frame(xgb_val_pred_geo.eighth, model = "xgb"))%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")
summary(lm_val_pred_geo.eighth$MAE)
summary(glmnet_val_pred_geo.eighth$MAE)
summary(rf_val_pred_geo.eighth$MAE)
summary(xgb_val_pred_geo.eighth$MAE)
summary(lm_val_pred_geo.eighth$MAPE)
summary(glmnet_val_pred_geo.eighth$MAPE)
summary(rf_val_pred_geo.eighth$MAPE)
summary(xgb_val_pred_geo.eighth$MAPE)
summary(lm_val_pred_geo.eighth$RMSE)
summary(glmnet_val_pred_geo.eighth$RMSE)
summary(rf_val_pred_geo.eighth$RMSE)
summary(xgb_val_pred_geo.eighth$RMSE)

#Rsquared
rsq(lm_val_pred_geo.eighth, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.eighth, mean_on, .pred)
rsq(rf_val_pred_geo.eighth, mean_on, .pred)
rsq(xgb_val_pred_geo.eighth, mean_on, .pred)

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/8-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.eighth%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/8 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/8 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.eighth, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/8 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

val_MAPE_by_typology.eighth <- val_preds.eighth %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.eighth)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/8 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Build the kitchen sink model using selected variables

sum(is.na(sce0))
library(dplyr)
install.packages("mltools")
library(mltools)
sce0<-sce %>% drop_na()

sce0 <- plyr::join(sce0, typology, type= "left")

library(data.table)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "SN_cat",
          dropCols = TRUE)
names(sce0)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Crosstown_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Express_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Local_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Flyer_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "NightOwl_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "HighFreq_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "InOut_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Clockwise_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "utshuttle_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Special_cat",
          dropCols = TRUE)

sce0 <- sce0 %>%dplyr::select(-STOP_ID)
data_split.sce0 <- rsample::initial_split(sce0, strata = "mean_on", prop = 0.75)

bus_train.sce0 <- rsample::training(data_split.sce0)
bus_test.sce0  <- rsample::testing(data_split.sce0)
names(bus_train.quarter)

cv_splits_geo.sce0 <- rsample::group_vfold_cv(bus_train.sce0,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

model_rec.sce0 <- recipe(mean_on ~ ., data = bus_train.sce0) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))


#Create workflow

rf_wf.sce0 <-
  workflow() %>% 
  add_recipe(model_rec.sce0) %>% 
  add_model(rf_plan)

# fit model to workflow and calculate metrics
#Metrics are changes from rmse + rsq to only rsq
rf_tuned.sce0 <- rf_wf.sce0 %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.sce0,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rsq))

?tune_grid
show_best(rf_tuned.sce0, metric = "rsq", n = 15, maximize = FALSE)



rf_best_params.sce0     <- select_best(rf_tuned.sce0, metric = "rsq", maximize = FALSE)

#Final workflow
rf_best_wf.sce0     <- finalize_workflow(rf_wf.sce0, rf_best_params.sce0)

rf_val_fit_geo.sce0 <- rf_best_wf.sce0 %>% 
  last_fit(split     = data_split.sce0,
           control   = control,
           metrics   = metric_set(rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
rf_best_OOF_preds.sce0 <- collect_predictions(rf_tuned.sce0) %>% 
  filter(mtry  == rf_best_params.sce0$mtry[1] & min_n == rf_best_params.sce0$min_n[1])
# collect validation set predictions from last_fit model
rf_val_pred_geo.sce0     <- collect_predictions(rf_val_fit_geo.sce0)
# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
rf_best_OOF_preds.sce0 <- rf_best_OOF_preds.sce0 %>% dplyr::select(-min_n, -mtry)
val_preds.sce0 <- rbind(data.frame(rf_val_pred_geo.sce0), data.frame(rf_best_OOF_preds.sce0) )%>% 
  left_join(., sce0 %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(typology) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()

val_MAPE_by_typology.sce0 <- val_preds.sce0 %>% 
  group_by(typology) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.sce0)%>%
  dplyr::select(typology,MAPE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAPE)) + 
  geom_bar(aes(fill = typology), position = "dodge", stat="identity") +
  ylim(0, 120)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "MAPE of the random forest model") +
  plotTheme()